Practical: epidemic curves, delays and transmission chains

Introduction

On the tidyverse

The tidyverse is a collection of R packages designed for data science. Developed with high standards of coding practices and software development, these packages form a consistent ecosystem to handle and visualise data. In this practical, we will mostly rely on the following packages:

  • dplyr for handling data, making new variables, building summaries, number-crunching

  • tidyr to re-arrange tidy/data

  • ggplot2 to visualise data

  • forcats to handle categorical data (factors)

  • magrittr for the piping operator (%>%)

Most of these functionalities are summarised in handy cheatsheets. We provide links to the most relevant ones below:



Other packages

Other packages we will use include:

Cheatsheets and other resources:

  • the rmarkdown cheatsheet

  • the knitr website documenting many options used in rmarkdown’s settings and code chunks



On the RECONverse

Several RECON packages will be used in the practical:

  • incidence2 to build and visualise epidemic curves

  • epicontacts to visualise and analyse contact tracing data or transmission chains

  • epitrix to estimate delay distributions

  • distcrete to handle delay distributions



What we will do today

This practical will use different outbreak data to illustrate how the various packages mentioned above can be combined to explore outbreak data and carry out analyses specific to the outbreak response context. Building on material seen in previous sessions, we will particularly look at:

  • building epidemic curves with various time intervals and groups
  • derive empirical distributions of key delays
  • fit delays to discretized Gamma distributions using Maximum Likelihood, and summarising the results
  • visualise and analyse transmission chains / contact tracing data

In addition, we strongly recommend storing all analyses into one or several rmarkdown reports.



How this document works

This practical will take you through initial exploratory analyses of different datasets. Some parts will be used merely to illustrate a technical point (e.g. how to count cases by different groups), some will focus more on the epidemiological meaning of a result, and some will do both. These analyses are merely meant as a guide, and are by no means an exhaustive exploration of the data. Be curious: feel free to ask new questions and try to answer them. In all instances, try to first answer your own questions: try/error is a great way to learn R. But if you feel stuck, do not hesitate to questions to the demonstrators: they are here to help.

In this practical, most command lines are provided, but not all. In the few instances where code is not provided, you will typically need to re-use and adapt previous code to produce the result requested. In most cases, copy-paste should be sufficient to reproduce the results.

However, it is important that you understand what the code does. Look at relevant documentation, try tweaking the code, experiment, and do not hesitate to ask questions. It is not essential to go through all examples to illustrate important points, so take your time, especially if you are new to R. Also note that the solutions provided are by not means the only options for solving a given problem - feel free to explore alternatives and discuss them with the trainers.



Data: Measles in Hagelloch, Germany, 1861

In this second part, we continue the analysis of the measles outbreak which took place in Hagellock, Germany, in 1861 (Oesterle 1993 ; Neal and Roberts 2004). The linelist data is distributed in the R package outbreaks, as the dataset measles_hagelloch_1861 (see ?measles_hagelloch_1861 for more information).

We first load required packages and the data, which we save in an object x for convenience.


# load packages
library(tidyverse) # this loads all tidyverse packages
library(outbreaks) # for datasets
library(incidence2) # for epicurves
library(epicontacts) # for contacts / transmission chains



# load the data, convert to tibble and store as 'x'
x <- measles_hagelloch_1861 %>%
  tibble()

# dimensions of the dataset
dim(x)
## [1] 188  12

# check the content of x
x
## # A tibble: 188 × 12
##    case_ID infector date_of_prodrome date_of_rash date_of_death   age gender
##      <int>    <int> <date>           <date>       <date>        <dbl> <fct> 
##  1       1       45 1861-11-21       1861-11-25   NA                7 f     
##  2       2       45 1861-11-23       1861-11-27   NA                6 f     
##  3       3      172 1861-11-28       1861-12-02   NA                4 f     
##  4       4      180 1861-11-27       1861-11-28   NA               13 m     
##  5       5       45 1861-11-22       1861-11-27   NA                8 f     
##  6       6      180 1861-11-26       1861-11-29   NA               12 m     
##  7       7       42 1861-11-24       1861-11-28   NA                6 m     
##  8       8       45 1861-11-21       1861-11-26   NA               10 m     
##  9       9      182 1861-11-26       1861-11-30   NA               13 m     
## 10      10       45 1861-11-21       1861-11-25   NA                7 f     
## # … with 178 more rows, and 5 more variables: family_ID <int>, class <fct>,
## #   complications <fct>, x_loc <dbl>, y_loc <dbl>
summary(x)
##     case_ID          infector      date_of_prodrome      date_of_rash       
##  Min.   :  1.00   Min.   :  1.00   Min.   :1861-10-30   Min.   :1861-11-03  
##  1st Qu.: 47.75   1st Qu.: 31.00   1st Qu.:1861-11-23   1st Qu.:1861-11-27  
##  Median : 94.50   Median : 50.50   Median :1861-12-01   Median :1861-12-05  
##  Mean   : 94.50   Mean   : 82.15   Mean   :1861-11-29   Mean   :1861-12-03  
##  3rd Qu.:141.25   3rd Qu.:153.00   3rd Qu.:1861-12-04   3rd Qu.:1861-12-07  
##  Max.   :188.00   Max.   :188.00   Max.   :1862-01-24   Max.   :1862-01-27  
##                   NA's   :4                                                 
##  date_of_death             age          gender     family_ID     class 
##  Min.   :1861-11-18   Min.   : 0.000   f   :83   Min.   : 1.00   0:90  
##  1st Qu.:1861-12-12   1st Qu.: 3.000   m   :94   1st Qu.:18.00   1:30  
##  Median :1861-12-13   Median : 7.000   NA's:11   Median :31.00   2:68  
##  Mean   :1861-12-12   Mean   : 6.989             Mean   :30.95         
##  3rd Qu.:1861-12-15   3rd Qu.:11.000             3rd Qu.:43.00         
##  Max.   :1861-12-28   Max.   :15.000             Max.   :69.00         
##  NA's   :176                                                           
##  complications     x_loc           y_loc      
##  yes:188       Min.   :  7.5   Min.   :  5.0  
##                1st Qu.:145.0   1st Qu.: 80.0  
##                Median :182.5   Median :147.5  
##                Mean   :188.2   Mean   :133.8  
##                3rd Qu.:240.0   3rd Qu.:187.5  
##                Max.   :280.0   Max.   :240.0  
## 

# ... use glimpse and View to explore content

Try using the functions glimpse, View, summary on x. What variables does the dataset contain?



Building an epicurve

Here, we show how incidence2 can be used to build epicurves using various groups and time intervals.

In its simplest form, incidence() builds a tibble reporting daily case incidence:

However, using different time intervals or adding groups is straightforward. For instance:

Group information can be displayed using colors (fill argument), facets (using facet_plot), or both, so that up to 2 groups can be used; for instance:

See ?plot.incidence2 for more plotting options.

Tips to customise graphs

  • Using custom colors: nice colors can go a long way towards communicating results; one way to customise colors is to use HTML color codes, for instance using this tool; all ‘hex’ codes e.g. #777284 can be used in R; also note there are professional designer tools to create consistent color palettes, such as paletton.com (see palettonr for importing these palettes)

  • Specifying data formats: date formats in R use key words, e.g. %d for the digit of the day, %B for the full month, etc.; details on these keywords are hidden in the documentation of ?strptime


Exercise: The graphs produced when plotting incidence objects are regular ggplot2 objects, so they can be customised as seen before. Use what you have learnt in previous sessions to:

  • create a new variablenew_gender by recoding the gender variable into male/female/missing
  • similarly, create new_class and recode the class variable in preschool/class_1/class_2
  • define a scale for dates using day/month/year e.g. 01/12/1831 for the 1st december 1831
  • use a custom color palette for the new gender variable
  • repeat the previous plot using these custom scales and the new variables




Estimating delays

R is good at handling dates, and is able to count the number of days separating dates. This makes calculating delays pretty straightforward. The only ‘trick’ to bear in mind is to convert the time difference to integer, as difference between dates is originally returned as a difftime object, which can behave in unexpected ways in further operations.

We illustrate this by calculating the delay between the onset of symptoms and the appearance of a rash:

This empirical distribution is informative, but only reflects a sample of cases. Using this data, what can we say, more generally, about the delay from onset of symptoms of measles to the onset of rash in this population?

We can try and address this question by fitting a discretised Gamma distribution to the data, using epitrix’s function fit_disc_gamma:


Question: compare the results of the fitted distribution to the empirical summary. What do you notice?


Let us examine the object delay_rash_fit. Contains parameters of the distribution, the corresponding log-likelihood, indications of convergence of the fitting process, as well as a distcrete object, which is essentially a list containing 4 functions providing the typical implementation of distributions in R:

  • $d: probability mass function
  • $q: quantiles
  • $p: p-values
  • $r: random number generator

We can assess the overall quality of the fit by comparing the empirical and fitted distributions; note the use of complete to make sure we define probabilities for delays that were not observed in the data:


Question: compare the two distributions; are you satisfied with the fit to the data?

Exercise:

  • use the r function to generate random samples from the distribution, using the same number of cases, and examine the resulting empirical distribution using qplot() to build quick plots; repeat this a few times; does a peak as observed in the empirical distribution seem likely?

  • What are the expected percentages of delays greater than 10 days? Same question for 5 days?


## [1]  0.1687248 16.9420068


Estimating severity

There is no outcome variable in the data, but date of death is documented for some cases. Assuming the absence of date of death indicates survival, we can build a new variable outcome using:

  • is.na(), which returns TRUE for all missing entries of a vector, and FALSE otherwise

  • if_else which will turn TRUE and FALSE into other values


Exercise: using the function prop_ci from the previous practical, calculate the 95% confidence interval of the case fatality ratio.




Visualising transmission chains

In this section, we will use epicontacts to visualise and analyse transmission chains. Technically, transmission chains can be represented as graphs where nodes are cases, and edges represent transmisison events. Note that the same principle applies to any kind of contacts, so that this tool can be used for contact tracing as well.

The main function make_epicontacts (see documentation ?make_epicontacts) requires two types of data inputs:

  • a linelist, describing features of the cases (the ‘nodes’), including a unique case identifier (number or label)

  • contacts, describing the transmission/contacts between these cases, with at least two columns indicating the source case (‘from’) and the recipient (‘to’); note that other columns can be used to provide information on the contact, e.g. the type of contact or the relationship between the cases

Often in epidemiological datasets, contacts information is recorded as a field in the linelist. In the present dataset, these fields are:

  • case_ID: the unique ID of the cases; indicates the recipient, so corresponds to to

  • infector: the unique ID of their infector; corresponds to from

Here, we first separate the linelist and contact data, and then build an epicontacts object and visualise it using plot:


# isolate information on the contacts
x_contacts <- x %>%
  select(infector, case_ID)
x_contacts
## # A tibble: 188 × 2
##    infector case_ID
##       <int>   <int>
##  1       45       1
##  2       45       2
##  3      172       3
##  4      180       4
##  5       45       5
##  6      180       6
##  7       42       7
##  8       45       8
##  9      182       9
## 10       45      10
## # … with 178 more rows

# make the epicontacts object
x_chains <- make_epicontacts(
  linelist = x, # linelist data
  contacts = x_contacts, # contact data
  id = "case_ID", # unique ID in linelist
  from = "infector", # 'from' column
  to = "case_ID", # 'to' column
  directed = TRUE) # transmission events are directed

x_chains
## 
## /// Epidemiological Contacts //
## 
##   // class: epicontacts
##   // 188 cases in linelist; 188 contacts;  directed 
## 
##   // linelist
## 
## # A tibble: 188 × 16
##       id infector date_of_prodrome date_of_rash date_of_death   age gender
##    <int>    <int> <date>           <date>       <date>        <dbl> <fct> 
##  1     1       45 1861-11-21       1861-11-25   NA                7 f     
##  2     2       45 1861-11-23       1861-11-27   NA                6 f     
##  3     3      172 1861-11-28       1861-12-02   NA                4 f     
##  4     4      180 1861-11-27       1861-11-28   NA               13 m     
##  5     5       45 1861-11-22       1861-11-27   NA                8 f     
##  6     6      180 1861-11-26       1861-11-29   NA               12 m     
##  7     7       42 1861-11-24       1861-11-28   NA                6 m     
##  8     8       45 1861-11-21       1861-11-26   NA               10 m     
##  9     9      182 1861-11-26       1861-11-30   NA               13 m     
## 10    10       45 1861-11-21       1861-11-25   NA                7 f     
## # … with 178 more rows, and 9 more variables: family_ID <int>, class <fct>,
## #   complications <fct>, x_loc <dbl>, y_loc <dbl>, new_gender <fct>,
## #   new_class <fct>, delay_rash <int>, outcome <chr>
## 
##   // contacts
## 
## # A tibble: 188 × 2
##     from    to
##    <int> <int>
##  1    45     1
##  2    45     2
##  3   172     3
##  4   180     4
##  5    45     5
##  6   180     6
##  7    42     7
##  8    45     8
##  9   182     9
## 10    45    10
## # … with 178 more rows

plot(x_chains)

Question: examine the interactive graph; note that you can zoom in/out using your mouse, reposition nodes by dragging them across, and see linelist data associated to cases by hovering the cursor above a node. What does this tell you about how the outbreak went on? Were there super-spreading events?


To help with the latter question, we can derive the distribution of the case reproduction numbers (R):

By default, the plot method of epicontacts object uses vis_epicontacts to produce interactive graphs. Look at the associated documentation ?vis_epicontacts and interpret the following commands and plot:


Question: examine the interactive graph; note that you can zoom in/out using your mouse, reposition nodes by dragging them across, and see linelist data associated to cases by hovering the cursor above a node. What does this tell you about how the outbreak went on? Were there super-spreading events?

Exercise:

  • adapt the code provided above to show families instead of classes as node colors; compare the resulting plots; can you postulate which transmission events took place at school or at home?




Estimating the serial interval


Exercise:

Combine two types of analyses you have carried out so far to estimate the serial interval (delay between symptom onest of primary and secondary cases).

  1. extract the empirical distribution of the serial interval from the transmission chains

  2. fit this distribution to a discretised Gamma

Results should match the outputs below:


## # A tibble: 184 × 1
##       si
##    <int>
##  1    10
##  2    12
##  3     9
##  4    10
##  5    11
##  6     9
##  7     9
##  8    10
##  9    11
## 10    10
## # … with 174 more rows
## $mu
## [1] 10.89177
## 
## $cv
## [1] 0.1500888
## 
## $sd
## [1] 1.634733
## 
## $ll
## [1] -352.9525
## 
## $converged
## [1] TRUE
## 
## $distribution
## A discrete distribution
##   name: gamma
##   parameters:
##     shape: 44.3918886018654
##     scale: 0.245355063017598
## # A tibble: 21 × 4
##       si     n   prop    proba
##    <int> <dbl>  <dbl>    <dbl>
##  1     0     0 0      1.94e-30
##  2     1     0 0      8.42e-19
##  3     2     0 0      1.05e-12
##  4     3     0 0      7.14e- 9
##  5     4     0 0      2.80e- 6
##  6     5     0 0      1.81e- 4
##  7     6     0 0      3.37e- 3
##  8     7     4 0.0217 2.49e- 2
##  9     8    19 0.103  9.03e- 2
## 10     9    32 0.174  1.86e- 1
## # … with 11 more rows


Question:

Compare the estimated serial interval to the delay between the last two cases. What does it suggest? Using arrange() combined with desc(), sort x by decreasing date of symptom onset to identify the last case. How many days has there been between the last 2 cases? Using the estimated serial interval, calculate the probability of a delay as long as this (or longer).




Transmission chains: further investigations

To complement what the visual inspection of transmission chains, it can be useful to further characterise transmission patterns e.g. in terms of gender, age classes, or locations. This requires, for each transmission pair, looking up features of the respective cases in the linelist, and reporting them. This can be done using get_pairwise, illustrated below using the new_gender variable:

This is a start, but not entirely satisfying, as we should be able to build a 2x2 table to summarise these patterns. If you look at ?get_pairwise, you will find we can specify what is done with the attributes of the infector and recipient by defining the function f. This permits a more meaningful output:

Here, the Chi-square test, which can be used to test whether two categorical variables are randomly associated (the null hypothesis), or on the contrary seem to be linked (the alternative hypothesis, in which case a low p-value is expected), suggests that there are no gender patterns in transmission.

We repeat this analysis with classes, and will add a graphical representation of the results:


Question: was transmission independent to the class structure? Which class seeded most infections? Which transmission routes were the most frequent?

Exercise:

  • adapt the code provided above to look at family patterns as well as age patterns; the resulting graphs are produced below



Exercise:

  • examine the code below and try to understand what the results represent

  • considering these results together with the ones above, what can you say about patterns of transmissions in this Measles outbreak?



Exercise (advanced):

  • examine the code below and interpret the results; what kind of statistical method is used here? what does it show?

  • apply the same analysis using the class; what are your final conclusions?




References

Neal, Peter J, and Gareth O Roberts. 2004. “Statistical Inference and Model Selection for the 1861 Hagelloch Measles Epidemic.” Biostatistics 5 (2): 249–61.

Oesterle, Heike. 1993. “Statistische Reanalyse Einer Masernepidemie 1861 in Hagelloch.” PhD thesis, Uitgever niet vastgesteld.